This notebook explores the more recent data from NYC Open Data Data Set.
This dataset can also be reached and interacted with through its Google BigQuery location
Try a different version: handout page, Tab Navigation
Reduce Car Accidents in Brooklyn
For this exercise, we’d like you to analyze data on New York motor vehicle collisions and answer the following question:
What are your ideas for reducing accidents in Brooklyn?
Imagine you are preparing this presentation for the city council who will use it to inform new legislation and/or projects.
Briefly:
Libraries that will be used during exploration
library(magrittr)
library(dplyr)
library(ggplot2)
library(viridis)
library(plotly)
library(maps)
library(rgeos)
library(rgdal)
library(ggthemes)
library(crosstalk)
library(leaflet)
library(d3scatter)
library(d3heatmap)
library(rnoaa)
library(DT)
#library(ggmap)
Collect API tokens in Environment Variables (purposefully kept hidden here). Tokens and keys used include Google Maps API key (get one here, Mapbox Access Token(get one here) and an NCDC token (here) for NOAA weather data.
Load data from /data directory and into memory
dt <- read.csv(file = "data/NYPD_Motor_Vehicle_Collisions.csv")
Inspect structure of dataset with the str() command:
datatable(str(dt))
## 'data.frame': 990800 obs. of 29 variables:
## $ DATE : Factor w/ 1709 levels "01/01/2013","01/01/2014",..: 200 200 200 200 1068 920 611 200 65 65 ...
## $ TIME : Factor w/ 1440 levels "0:00","0:01",..: 556 631 632 644 661 936 46 686 1051 1066 ...
## $ BOROUGH : Factor w/ 6 levels "","BRONX","BROOKLYN",..: 1 2 2 3 1 1 1 1 3 3 ...
## $ ZIP.CODE : int NA 10454 10466 11218 NA NA NA NA 11218 11236 ...
## $ LATITUDE : num 40.7 40.8 40.9 40.6 40.7 ...
## $ LONGITUDE : num -73.9 -73.9 -73.9 -74 -73.9 ...
## $ LOCATION : Factor w/ 90272 levels "","(0.0, 0.0)",..: 28545 73165 89328 17808 52600 1 1 14824 17763 18379 ...
## $ ON.STREET.NAME : Factor w/ 9151 levels "","?EST 125 STREET",..: 1420 1 3493 1 1 1 6598 4069 738 7071 ...
## $ CROSS.STREET.NAME : Factor w/ 9585 levels "","0","01247",..: 1 1 9364 1 1 1 7399 3638 114 4201 ...
## $ OFF.STREET.NAME : Factor w/ 59908 levels "","(26 BROOKLYN TERMINAL MARKET LOT)",..: 1 38225 1 29898 1 1 1 1 1 1 ...
## $ NUMBER.OF.PERSONS.INJURED : int 0 0 1 0 0 0 0 0 1 2 ...
## $ NUMBER.OF.PERSONS.KILLED : int 0 0 0 0 0 0 0 0 0 0 ...
## $ NUMBER.OF.PEDESTRIANS.INJURED: int 0 0 1 0 0 0 0 0 0 0 ...
## $ NUMBER.OF.PEDESTRIANS.KILLED : int 0 0 0 0 0 0 0 0 0 0 ...
## $ NUMBER.OF.CYCLIST.INJURED : int 0 0 0 0 0 0 0 0 0 0 ...
## $ NUMBER.OF.CYCLIST.KILLED : int 0 0 0 0 0 0 0 0 0 0 ...
## $ NUMBER.OF.MOTORIST.INJURED : int 0 0 0 0 0 0 0 0 1 2 ...
## $ NUMBER.OF.MOTORIST.KILLED : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CONTRIBUTING.FACTOR.VEHICLE.1: Factor w/ 49 levels "","Accelerator Defective",..: 10 47 47 47 47 11 1 43 43 43 ...
## $ CONTRIBUTING.FACTOR.VEHICLE.2: Factor w/ 49 levels "","Accelerator Defective",..: 47 1 1 47 47 47 1 47 47 47 ...
## $ CONTRIBUTING.FACTOR.VEHICLE.3: Factor w/ 43 levels "","Accelerator Defective",..: 1 1 1 1 1 1 1 1 42 1 ...
## $ CONTRIBUTING.FACTOR.VEHICLE.4: Factor w/ 42 levels "","Accelerator Defective",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ CONTRIBUTING.FACTOR.VEHICLE.5: Factor w/ 31 levels "","Aggressive Driving/Road Rage",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ UNIQUE.KEY : int 3612721 3612791 3618743 3614471 3284922 2833714 336679 3618925 3598095 3597360 ...
## $ VEHICLE.TYPE.CODE.1 : Factor w/ 18 levels "","AMBULANCE",..: 15 10 12 15 10 10 1 10 10 15 ...
## $ VEHICLE.TYPE.CODE.2 : Factor w/ 18 levels "","AMBULANCE",..: 10 1 1 10 16 10 1 10 10 15 ...
## $ VEHICLE.TYPE.CODE.3 : Factor w/ 18 levels "","AMBULANCE",..: 1 1 1 1 1 1 1 1 15 1 ...
## $ VEHICLE.TYPE.CODE.4 : Factor w/ 18 levels "","AMBULANCE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ VEHICLE.TYPE.CODE.5 : Factor w/ 16 levels "","AMBULANCE",..: 1 1 1 1 1 1 1 1 1 1 ...
Inspect summary of dataset with summary() command:
datatable(summary(dt))
Our Dataset structure revealed the variables and their classes sapply(names(dt), function(x) paste0(x, ' is class: ', class(dt[[x]]))) =
## DATE
## "DATE is class: factor"
## TIME
## "TIME is class: factor"
## BOROUGH
## "BOROUGH is class: factor"
## ZIP.CODE
## "ZIP.CODE is class: integer"
## LATITUDE
## "LATITUDE is class: numeric"
## LONGITUDE
## "LONGITUDE is class: numeric"
## LOCATION
## "LOCATION is class: factor"
## ON.STREET.NAME
## "ON.STREET.NAME is class: factor"
## CROSS.STREET.NAME
## "CROSS.STREET.NAME is class: factor"
## OFF.STREET.NAME
## "OFF.STREET.NAME is class: factor"
## NUMBER.OF.PERSONS.INJURED
## "NUMBER.OF.PERSONS.INJURED is class: integer"
## NUMBER.OF.PERSONS.KILLED
## "NUMBER.OF.PERSONS.KILLED is class: integer"
## NUMBER.OF.PEDESTRIANS.INJURED
## "NUMBER.OF.PEDESTRIANS.INJURED is class: integer"
## NUMBER.OF.PEDESTRIANS.KILLED
## "NUMBER.OF.PEDESTRIANS.KILLED is class: integer"
## NUMBER.OF.CYCLIST.INJURED
## "NUMBER.OF.CYCLIST.INJURED is class: integer"
## NUMBER.OF.CYCLIST.KILLED
## "NUMBER.OF.CYCLIST.KILLED is class: integer"
## NUMBER.OF.MOTORIST.INJURED
## "NUMBER.OF.MOTORIST.INJURED is class: integer"
## NUMBER.OF.MOTORIST.KILLED
## "NUMBER.OF.MOTORIST.KILLED is class: integer"
## CONTRIBUTING.FACTOR.VEHICLE.1
## "CONTRIBUTING.FACTOR.VEHICLE.1 is class: factor"
## CONTRIBUTING.FACTOR.VEHICLE.2
## "CONTRIBUTING.FACTOR.VEHICLE.2 is class: factor"
## CONTRIBUTING.FACTOR.VEHICLE.3
## "CONTRIBUTING.FACTOR.VEHICLE.3 is class: factor"
## CONTRIBUTING.FACTOR.VEHICLE.4
## "CONTRIBUTING.FACTOR.VEHICLE.4 is class: factor"
## CONTRIBUTING.FACTOR.VEHICLE.5
## "CONTRIBUTING.FACTOR.VEHICLE.5 is class: factor"
## UNIQUE.KEY
## "UNIQUE.KEY is class: integer"
## VEHICLE.TYPE.CODE.1
## "VEHICLE.TYPE.CODE.1 is class: factor"
## VEHICLE.TYPE.CODE.2
## "VEHICLE.TYPE.CODE.2 is class: factor"
## VEHICLE.TYPE.CODE.3
## "VEHICLE.TYPE.CODE.3 is class: factor"
## VEHICLE.TYPE.CODE.4
## "VEHICLE.TYPE.CODE.4 is class: factor"
## VEHICLE.TYPE.CODE.5
## "VEHICLE.TYPE.CODE.5 is class: factor"
The first thing to come to mind with such a factor heavy dataset is counting. Factors are not a hodgepodge collection of values, observed as they are pick up off the ground. Each Level of a factor - ideally - should have been intentionally designed. Ordered and distributed according to a purpose greater than the unit. Though not statically related as quantifiable assets, the levels in a factor are each related to one and other in addition to the group as a whole.
There are a lot of empty cells. To make sure we use a universal value for blank or Not Available we will assign the value NA to all blank cells. While munging around with the data, add what could be a valuable variable created from two current variables. The DATE and TIME variables are set up as factors. This has interesting categorical value so they will stay in the data. Rather than replace the two rows, add a third one in a POSIX date form.
dt[dt == ''] <- NA
# create date.time column for time use
#dt$date.time <- as.Date(dt$DATE, format("%m/%d/%Y"))
# use hours & min in date.time var for calculations
dt <- within(dt, {date.time = as.POSIXct(strptime(paste(DATE, TIME), "%m/%d/%Y %H:%M"))})
With Latitude and Longitude present and appearing to be fairly well documented, let’s take a quick look at how these accidents look over an interactive world map (incase of mistakes outlying somewhere aside from New York). We will use the BOROUGH variable as a factor. This gives the geographic association of each borough and allows us early forsight into anything specific about our point of interest BOURGH == "BROOKLYN"
mp <- dt %>%
plot_mapbox(lat = ~LATITUDE, lon = ~LONGITUDE,
split = ~BOROUGH, mode = 'scattermapbox') %>%
layout(mapbox = list(zoom = 9,
center = list(lat = ~(40.7), lon = ~(-74.0))),
title = "All Events Across NYC")
plotly_build(mp)